DOCUHEBT BSSDHE 



ED 131 122 



TH 005 848 



EDRS PRICE 
DESCRIPTORS 



aOTHOB Roberts, A* 0. H. 

TITLE Foibles and Fallacies in Educational Evaluation. 

POB DATE Apr 76 

*OTE 23p^; Not available in hard copy due to marginal 

legibility of original document; Paper presented at 
the Annual Meeting of the American Educational 
Research Association (eoth, San Francisco, 
California, April 19-23, 1976) 

I5F-$0.83 Plus Postage. HC Not Available from EDRS. 
Achievement Gains; Analysis of Covariance; 
♦Compensatory Education Programs; Criterion 
Referenced Tests; Educational Innovation; Elementary 
Secondary Education; ^Evaluation Methods; Federal 
Programs; Norms; *Program Evaluation; Research 
Design; ^Research Problems; Statistical Bias; Tests 
of Significance; Test Wiseness 

ABSTEACT 

Federal assistance for special educational programs 
makes necessary the regular study of evaluations of thousands of 
innovations in compensatory education, bilingual education, and 
reading programs. The results are reported to the President and to 
Congress. However, investigating organizations find only a few 
programs with adequate evidence and thousands with faulty evaluation 
designs. Some of the most common faults are discussed, with examples. 
There are other factors which lower hopes. If greater numbers with 
real evidence could be found, knowledge would increase even without 
an increase in the number of exemplary programs. (Author) 



* Documents acquired by ERIC include many informal unpublished * 

* materials not available from other sources. ERIC makes every effort * 

* to obtain the best copy available. Nevertheless, items of marginal * 

* reproducibility are often encountered and this affects the quality * 

* of the microfiche and hardcopy reproductions ERIC makes available * 

* via the ERIC Document Reproduction Service (EDRS). EDRS is not * 

* responsible for the quality of the original document. Reproductions * 

* supplied by EDRS are the best that can be made from the original. * 



Foibles and Pallaoj^^ixi Educational Evaluation 



A. 0. Foberts 
ET-IC Be search Corporation 



April 1976 



U.S. OEPARTMENTOF HEALTH. 
EDUCATION A WELFARE 
NATIONAL INSTITUTE OF 
EDUCATION 

THIS DOCUMENT HAS BEEN REPRC 
DUCED EXACTLY AS RECEIVED FROM 
THE PERSON OR ORGANIZATlON.ORIGlN- 
ATING IT POINTS OF VIEW OR OPINIONS 
STATED DO NOT NECESSARILY REPRE- 
SENTOFFlClAt NATIONAL INSTITUTE OF 

eoucation position or policy 



Irttroductiop 

oince the begirini.ng- cf Federal assistance to coEpensatory edvication, the 
recipients of fvxxds have taker pari, in annual studies of the success of their 
in.novations; ostensibly to weed out the unsuccessful ones, to itoprove some, 
and to demonstxate the vorth of Qie more successful with a viev; to dissemina- 
tion. Also, one woiad hope, to justify the continuation of public funds. 
Repori.s go to Congress and to the President from the U.S. Office of Education 
and from the 1^-ational Institute of Education. In turn USOE and KIE fund 
professior^l organizations to undertake studies of evaluations of compensatory 
(Title I), bilingual (Title VTl), and readin^^ programs (Title Till). These 
professional organizations cast their nets as wide as they can and draw 
responses from many hundreds of programs in receipt of funds under one or 
more of these Titles; and part of each response is, or should be, an evalu- 
ation. About cue third of these evaluations are written by subcontractors 
or consultants, and the rest by specialists in school district offices. 
By far the greater proportion of these evaluations are valueless as demon- 
strations; notice that it is not that they prove the program to be without 
•value but that they provide no useful evidence one way or the other. It 
has been this way from the beginning, and only in part because the difficulties 
are great; but it is astonishing to see how pervasive, persistent and ele- 
mentary are some cf the avoidable wrong practices. Hawkridge, Ghalupsky and 
Roberts (I968) examined over 1,000 program evaluations, selected 98 for site 
visits and found only 21 that met the criteria. Tallmadge (1974) screened 
about 2,000 looking for exemplary projects and found I36 of which he sub- 
sequently had to reject all but six, Bowers, Gampeau and Roberts (l974) 
searching for exemplary reading programs, could find only 26 out of an 
initial 1,520. These evaluations by consultants, research organizations, and 
school district specialists represent a great deal of work and much of it 
expensive and wasteful; worse still it robs at least some programs of the 
opportunity for recognition. Lastly, wading fruitlessly through literally 
thousands of pages in search of usable evidence is at least frustrating for 
those who evaluate these evaluations. What follows should be seen as a 
constructive attempt to increase the proportion of useful evaluations from 
programs. 

There are two main headings in what follows. In the first are discussions 



3 



of some frequently found tut ayoida*ble errors^ and in the second, some ex- 
amples of difficiaties usually beyond the control of educators. Perhaps there 
should have been a third, of practical difficulties and of sources of bias 
whose explicit recognition is demanded, but for which the. only actions are 
reasonable allowance or approximate corrections, or a.t least discussion. 
Here vould be found errors resulting from regression to the mean, from 
volunteering, from attrition or from loss of data. 

Major Fallacious Approaches in the Evaluations 

Tery few pipgrajixs pass even the minimal requirements. This section 
describes some of the inaccurate and inappropriate approaches that are found, 
not just occasionally, but \f±th considerable frequency. Those described here 
appear so often that they can be considered to be serious trends in a field 
where the backgroiond of varyin£' approaches and the resulting controversy has 
made sound evaluation more essential than ever^ 

tTse of Criterion-Eeferenced Tests 

Scores of evaluation reports appear with "criterion-referenced testing" 
as the central, if not the only theme; with no comparison group, no attempt 
to give meaning to the figures quoted and often only the scantiest description 
of objective Sc In this form it is difficult to see how testing can argae 
success, although it is easy to understand its popularity; for several years 
now^ writers have been actively promoting the belief that if innovations do not 
live up to expectations, it can only be because tests measured what was not 
being taught, did not measure what was, did it the wrong way and with score 
conversions which were inappropriate. For these criticisms, criterion-referenced 
testing is the "perfect" answer — 

• It specifically and intentionally precludes the very com- 
parisons that norm-referenced testing makes possible. 

• Those with a vested interest in the success of the innova- 
tion have the unchallengeable control of objectives and 
ciirriculum, and even of the definitions used; and these 
need ha.ve no relationships to existing systems. 

o They alone determine what constitutes successful achievement 
or "mastery" of objectives, frequently in terms of quite 



4 



-5- 



artitrarily chosen percentages. (We liave.seen these racging 
from 60 percent to 95 percent, even within a single study,) 

# There can he no checks on the validity or reliahility of 
their tests, or on objectivity or consistency of scorings 

^ • Disappointing results can he hlamed on teachers, defini- 
tions of the instructional objectives, or standards set 
toe high. 

• They can even give purely subjective standards a cloak 
of respectability by converting results to tallies. 

Given the sacie freedoms to make the rules of the game and name blank cards, 
one need never lose at Poker or Bridge. 

Here is an actual example which though soinewhat extreme is not completely 
atypical. The innovators of a -program laid do^^ their own objectives (not 
given) and provided their own evaluative tools "tc assess accomplishment for 
each of the designated subject areas" (no examples or descriptions). They 
dismissed the state-mandated standardized tests as being unsuitable, and 
although these were given to students, they made no use of the data collected. 
An objective was said to be "mastered" if 70 percent or more of the students 
completed it. The same instruments were used for pretest and for posttest, 
after which differences in proportions mastering- the ohjecjiives were tested 
with a correlated means t-test; thie. gave them two bites at the same cherry, 
since if "mastery" was not attained (below 70 percent managing the objective) 
then significant "gaiDs" could perhaps be shown. For example, the six pupils 
in a grade 1 improved their mean proportion of correct responses from 0,04 
on the pretest for social studies to 0.09 on the posttest with a t value of 
3.21, significant at better than the 1 percent level. In passing, this 

program managed to produce some of the highest t-values ever seen one, a 

value of 36.57, achieved using a sample of only 12; another of 27.67 from a 
sample of 28; and what must surely be an all-time recor'^p- 52.17^ on 53 pupils* 
(^They modestly put this as significant at better than the 1 percent levels) 
When only 67 percent of the objectives were completed, they "concluded 
that the instrument and objectives must be reevaluated for grade level 
appropriateness and content validity ♦" In fact, a fair number fell below 
the 70 percent level, but this was alwa:ys laid at the door of design of in- 
strument or choice of objective—never the result of failure of treatment. 



-4- 



ERIC 



- a?his is periaaps a kinder explanation than that in another case we en- 
countered, where aii evaluator who was also the innovator blamed the poor 
showing on unnecessarily strict standards set by teachers in deciding whether 
the objectives had been achieved. He promised to reeducate those teachers for 
the next year! 

Although the aids are not new, the stress being put on specific r^ng 
of objectives, on counting pupils who attain each, and on checking pupils, 
objectives, and teaching methods wherever testing shows failure, are profitable 
uses of criterion-referenced testing. Used properly, this, approach can raise 
signals at any one or more points where attention is called for; it may be 
that the objective needs to be reconsidered cr defined, or that the standard 
of judging is inappropriate, or that the treatment needs modification. That 
in itself points clearly to the limitations of criterion-referenced testing— 
at least as it is being used: It is flexible enough at every level to be 
considered plastic: each school or district chooses its own objectives, which 
may have little to do with the basic intent of the funding, and may even be 
at variance with it on occasion; if too few achieve an objective their numbers 
can be increased by changing the standards, and we never encountered any 
attempt to demonstrate the objectivity, validity, or reliability of the scoring 
system. This statement appeared in one report: "Student participants per- 
formed better cn criterion-referenced tests than they did on standardized 
or commercially prepared instruments. This performance lends encouragement 
to the continued development of instruments for (such) education." 

The strength of a chain is the strength of its weakest link—and that 
is no less true for a chain of logic. The use of these head-counts (per- 
centage of pupils passing objectives) for significance— or confidence testing 
with correlated means t-tests— gives no confidence whatever, if for example, 
one has no test of the objectivity of teachers' judgments, or of the appro- 
priateness of the specification for the criterion percentage. 
If now, to find corrections for these objections — 

• universal objectives are set (i.e., the same objectives 
are set for all schools) ; 

• the scoring system is made objective; 

• the standards set are tied to typical performances 
instead of being arbitrary (as, for example, "70 per- 
cent will pass 80 percent of the objectives") 

—then comparisons become possible, but the test is then norm-referenced and 

6 



'becomes subject to the criticisms to which ve referred earlier, 
glests of Sigrxificance of Gains 

Especially as an attempt to -plug the holes left by the fashionable cri- 
terion-referenced testing, one of the most ubiquitous experimental designs 
being used to show the "worth" of educational innovation is the derivation 
and statistical testing of gain scores (i.e,, pcsttest minus pretest). The 
statistical test applied is usually the correlated means t-test, although 
soicetimes the Wilcoxon Signed Ranlrs Test is used, Vhen the gain scores result 
froin criterion-referenced tests, we can usually dismiss the demonstration^ 
But even results from recognised achievement tests cause problems more often 
than not- When these scores are in the form of grade equivalents, it is simply 
an exercise in futility and a red herring, at best; depending upon the grade 
level, a gain of a mere three to five months over a year, for a sample of I7 
pupils would be likely to be significant at the 0*1 percent level or better— 
and even that could te just maturation, or practice effect, or both, and 
independent of even trivial education. But when the scores being used are 
standard scores, or worse still (as we frequently found) are the original 
raw scores, it is not even possible to translate the gains into mesjiingfiil 
terms at all. 

In general,' programs are obviously planned to achieve specified goals, 
a:nd not to produce research findings. Evaluation is secondary, and in 
practice a significant proportion of programs subcontract this aspect out 
to individual consultants or to one of several specialist organizations; 
roughly one third of all programs; have their evaluations done in this way, 
¥hen this is done late in the process, these consultants and organizations find 
themselves in the role of ^'hired guns" defending the program's claims and 
funding, but in terrain not of their choosing. If called upon to do so in 
time, theb^e evaluators could in many cases plan to collect more useful data; 
but when simply given raw data already collected, they will be tander some 
pressure to present summaries in the most favorable light. The remedy of 
course lies in making greater' efforts to ensure that minimum standards for 
evaluation are met before funding the., inn ovation. 

One organization in particular is being employed to do evaluations by 
school districts from all over the nation v/ith innovations in reading, bi- 
lingual education, and other areas. In general, this organization does good 
work, but it has made a fetish of significance testing of gain scores. 



devoting a sizable proportion of its research effort to it. This produces 
laige numbers of "significances" which no doubt please those innovatcrs 
without statistical sophistication, but often cloud some other, and more 
dubious, leSTilts. In one case, the report contained r.o less than 204 cor- 
related means t-tests, mostly of ^aw scores. Predictably, more than ^^alf of 
these were significant at the 0.1 percent level; only 4I were "not significant" 
in spite of the fact that 60 of the samples contained 14 or fewer students, 
B3very class was tested separately and again as part of the grade level. One 
sample of I7 produced a t of over 25— for which the evaluator, with a modesty 
which belied his zeal, claimed a significance of "better than 0.1 percent"; 
in fact, it had the astronomical value of well beyond one-in-a-googol (i.e., 
one-in-10^°° )! Almost exactly half of this evaluation report of over 200 
pages was devoted to this type of reasoning, and most of that in'tabulation and 
bar graphs. 

One does not need to be a mathematical statistician or an educational 
philosopher to see the fallacy of this approach, but its implications and im- 
pacts should be considered carefully. The rough mathematical equation 
following should be regarded just as a foundation for reasoning. 

t = Sample size - 1) x (Differenc e betwee n vxe^ and T^osttest scores) 
V^^i^um of variajices oi' the two tests; - (Twice the^ geometric mean" 

of the variances x correla- 
tion of t?ie tv;o tests)' 



or 



, as a very close approximation, but much more concisely 



t =: Ge.±n ^ /Sample size — 1 

Average Standard V 2(l ^ Con-elation) 
Deviation 

r 

Now, it should be easy to see that if only the correlation changes, the 
value t is smallest v;hen there is no correlation at all between pre- and post^ 
tvist scores;^ in which case just what is the rationale for subtracting the 
one from the other? Subtracting horses from sheep? 

On the 'other hand, however, if the two tents are thoroughly reliable, 
and are measuring just the dimension of interest, the correlation will be 



A colleague points out that negative correlations would make t still smaller, 
Has anyone seen a negative pretest-posttest correlation lately? 

8 



-7- 



highj and as it gets closer to unity, the t value goes n? like a balloon. 
For exanple, if the correlation between the t£s+,s is O.e^i (not a dxamtic 
figure) and there were 33 pupils, we will be multipljjng the ratio of gain-to- 
average standard deviation by a factor :>f 10. Thus., using grade' equivalents for 
illustration (for which a fairly typical average standard deviation is about 
eight to ten months), a gain of four months over a whole year would produce 
a t value of between 4 and 5--significant at around the 0.1 percent one-tailed 
level. 

Now, particularly in the lower grades, a combination of a year's matura- 
tion, practice effect of the first testing, with disadvantaged pupils in an 
ordinary- traditional classroom has typically been increasing vccabularj^ 
reading skills, and even basic; mathematics by a grade eqxiivalent growth of 
about 7 months. For a typical class of 25 pupils, and a pretest-posttest 
correlation of .87 (an act\ial figure) we would get a t value of about 6.8, 
with a one-tailed significance of better xhan the 0.000025 percent levell 
To interpret this as an indication of dramatic success would be ridiculous. 

Few educational innovators are also trained researchers; in fact, only 
the largest districts have departments that can deal with statistical analysis. 
For many teachers who already are inclined to point to "the happiness of the 
children" as a demonstration of success, this kind of statistical glitter is 
misrepresentation which they will find difficult to resist. For those who 
are evaluating the report, it is clutter with a nuisance value at best, and 
otherwise a source of additional computational demands in an attempt to 
derive meaningful information. For example, one can gxxess at the correla- 
tion between the two tests, then multiply the t value, if given, (or the esti- 
mated value, if not) by /2(1-r) . The result is an estimate of Gain . 

^ Average SD 

which is a measure of impo3rtant educational change if the tests are 
standardized. 

The crucial flaw in this approach is not, of course, that there is any- 
thing wrong with the statistical procedure itself or even that it does not 
test the hypothesis proposed; it is that the hypothesis is a trivial one, not . 
worthy of testing. What exactly is the null hypothesis implied? It is that: 
"No increase of learning has occurred over the period." 

Some learning is taking place even in the absence of teaching; maturation, 
what children learn from one another, and even what they learned of test- 
taking itself can all be expected to create change. The correct null 



9 



hypothesis then should be: 

"Chan^ in educational conditions has brought no 
change in the r^te of increase of learning." 

This hypothesis is tested when, with appropriate care, 

• a control or comparison group is used; ox 

9 comparison is made with the rate of increase in the 
same group before change in educational conditions, or 

• comparison is made with increases in classes previous 
to educational change; _or 

• some reasonable basis exists for establishing an 
expectation of increase in the absence of educa- 
tional change. 

The phrase "with appropriate care" above is vital. 

There are occasions when the hypothesis, "No change has occurred," is 
appropriate, although it is probably safe to say that this is never so when 
considering educational increase. The exceptions are when no change can reason 
ably be expected without intervention. Por example, it is reasonable to expect 
no significant change in affective measures unless changes have been made in 
the environment. Examples are self-concept, and teacher and parent attitudes. 
It is notev/orth_y that t-t ests in these areas are frequently non-^significant ^ 
The same organization mentioned earlier repeated]-y did such tests; here are 
some figures from the three separate evaluations (different districts) in 
which affective measures occurred. 

# Of 26 educational changes, only one was not significant; 
but of five measures of affective change, all five failed. 

# Of 66 educational changes, only foiir were not significant; 
but of 75 affective measxires, 44 failed the test. 

9 Of 90 educational changes, 18 were not significant; 
but of 30 affective tests, 20 were non-significant. 

Misuse, of Analysis of Covariance 

A more sophisticated version of t-testing is analysis of variance to- 
gether v/ith its offshoot, analysis of covariance. It is found much less 
frequently since it demands more expertise. Nevertheless, it has less value 
than some of its protagonists v/ould believe, being for the most part a means 
of measuring confidence in observed change, rather than dimension of change. 



10 



EKLC 



■ ■ -9- 

When applied to pretest and posttest scores, the same limitations apply as 
for testing of gains. Analysis of covariance, in partictaar, is occasionally 
found misutjed^ 

This procedure is sometimes used to make "adjustment s'» for starting 
differences between treatment and comparison groups on pretests* Theoretically, 
this makes it possible to compare gains of dissimilar groups. When these 
starting- differences are themselves non-significant, such adjustments do little 
harm, have at least a superficial logic, but serve little purpose. But when 
xhe differences are large, adjustments cannot be supported, most especially 
when the control group has the lower pretest performance. 

Analysis of variance was the precursor and foundation for analysis of 
covariance. It was devised by Sir Ronald A. i?^isher early in this century. 
He worked primarily in areas of anthropology and agriculture, where measure^- 
ments were almost invariably the most refined type— ratio measurement. For 
this type there is a true zero, with considerable confidence that eq.ual 
intervals at widely separated points can be compared; such measures are lengths,' 
weights, volumes, values, and counts. Educational measures are not of this 
type; they are sometimes termed interval measures (the next lower type in terms 
of information provided) but even this may be deceptive, since it implies 
equality of intervals.' There is no direct v;ay that the equality of these 
intervals can even be tested; and only the most tenuous way in which rough 
equivalence can be inferred (through assumption of normal distribution, for 
example). Adjustment through analysis of covariance extrapolates the scale 
for the lower group upwards, and tnat for the superior group downwards. Even 
from two widely separated regions of scoring from a single test, but still 
more so from different levels of tests, there can be no assurance that the two 
groups are being measured on the same scale, or even for that matter on pre- 
cisely the same continuum. There are other more cogent objections to this 
process, but they call for lengthy statistical argument. Those who can benefit 
from more sophisticated discussions of the problems involved should refer to 
the articles by Huck and McLean (1975), and Porter and Chibucos (l975). We 
will rest our case, but in agreement with Tallmadge and Horst (l974)^, we....,..- 
should reject conclusions based only upon such evidence. 

This should not be seen as a criticism of the analytic process, but of 

Tallmadge, G. K., & Horst, B. A procedural guide for validating achievement 
gains in educational projects. (Bevised) Los Altos, California: KMC 
Research Corporation, December 1974. 

II 



-10- 



one use to which it is frequently put/ In general, though, the indications 
deriving from analysis of variance seem more useful as starting checks than 
as arr^ents for success, They certainly should not he used for major 
sculpttixing of unsuitable data. The Encyclopaedia Britannica has a section 
dealing with the vork of Sir B. A, Fisher in general and analysis of variance 
in particular* It concludes with "Unfortunately the finest statistical treat- 
ment will not compensate for poorly selected units of onservation or measure- 
ment • " 

Practice Effect 

An issue which has received rather scant and cursory attention, hut 
which ha? the potential for a major upset of many reports on special edu- 
cational programs, is that of the effecxs of practice alone on retest scores. 
In the literature, there are some "brief caveats ahout the increases to he 
expected from Vtest sophistication^ or from "test interactions" hut with few 
-estimates cf the size of the effect or its duration. See, for example, 
Campbell and Stanley in Handbook of Research on Teaching (n. L. Gage, Ed., I965, 



P' 



75)- 



In the field, there is a strong tandency for innovators to discount the 
possibility of bias as a result of practice effect—except when it can be 
turned to profit; or when there is a fear that the bias can tell against the 
innovation. The arguments usually given for ignoring the threat are: 

• A time interval of seven months or more between testings 
(the most frequent lapse 01 time) is enough to wipe out 
gains resulting from familiarity with tests or with 
individual items. 

• Use of a parallel test nullifies practice effect. 

• The effects are allowed for, when a control group 
is used. 

• Modem children are so familiar with test situations 
that they have already reached a plateau where fiarther 
testing can add nothing more . 

The effects are small- enough to be ignored. 

For practical purposes |here will be no further increase * 
from practice alone, fr6m. the third testing onward; or, 
another form of the same argument — 



# 



-11- 



• By intensive coaching spxirious benefits can be exhausted 
before the first testing, so that gains are uncontami- 
nated. 

How valid are these arguments? The last sounds plausible enough, except 
when capital is to be made of the absolute measxire of achievement from such 
normed conversions as grade equivalents, percentiles, stanines, or standard 
scores; then we would have to remember that the norming sample did ncrt receive 
the benefits of such coaching. Even this is not enough. Bright students can 
benefit more than duller ones, from such advice as: "Eliminate one or more 
obviously wrong alternatives and then, if still in doubt, deliberately guess 
from amongst the remaining choices." As a result, subgroups would be identi- 
fied for which the treatment appeared to be a success. 

But the most cogent counter-argument is this: Here, as so often happens 
when dealing with living things, biases tend to be positively correlated 
and not random; and therefore additive, not cancelling. Who are the ones 
aimed at by the various compensatory education acts? They are those 

• from poorer environments 

• in ill-equipped schools 

• with severe reading disabilities 

• with more dependence upon drill and 
less upon understanding 

• from non-English speaking homes, 
particularly immigrants 

• who are very young, and at the threshold 
of their educational experience. 

And who are the ones with the least contact with tests, with least famili- 
arity with test-taking conditions and with the content of tests, and most 
subject to the debilitating effects of anxiety and failure? Who have most to 
gain from experience, from practice and specific instruction? Precisely the 
same categories of students . 

On a priori grounds we would expect there to be effects, at least for 
some varieties of tests, more especially in cognitive areas. Over' the years 
these expectations have been dealt with under such headings as "Practice 
Effect," "Test Sophistication," "Test Interaction," and (as a result of Huff's 
book (1961) Score — The Strategy of Takinig Tests) "Test Wiseness" — a concept 
which Millman, Bishop and Ebel (I965) discussed in some depth and which in 



13 



-12- 

turn sparked attention from several others about that time. But although there 
has been a great deal of lip-service and some experimentation, both reasoning 
and demonstrations have been honored more in the breach than in the observance— 
in recent years on a very large scale indeed. If repeated testing alone can 
make significant increases in later scores, then a substantial proportion of 
compensatory education demonstrations of all sorts (Title I, Title YII, and 
Title VIII included) f r- " i - ':en years fall under a large cloud of sus- 
picion; perhaps so^ demonstrations can even be txpr \ to this 
as will be seen. ^^uer proportion of all evalu ■ of com- 
pensatory education iiuiuva lions depend entirely upon a show of gains over two 
or more testings, with many such gains being statistically significant but 
practically minor—of the order of a third of a standard deviation or less, 
A preliminary, and rather quick literature search has turned up little 
more than discussion, and a few demonstrations, mostly on small samples* In 
one study, Ernest Lewis (l975) reported significant rises on IQ tests for 
860 grade 6 students, more particularly on Verbal IQ, but without indicating 
size of effect. Welch and Walberg (l970) found no significant effects when 
2,200 students from secondary schools were tested on such things as physics 
achievement, understanding science, process knowledge, etc. These students 
had an average IQ of 116. Lucas (l972) considered that their conclusions 
could be misleading, and found significant effects of up to one standard 
deviation using three samples of grade 12 students, each of about 50, and using 
the Watson-Glaser Critical Thinking Appraisal. Callenbach (1975), using 48 
second grade students, found significant effects; he says "Although measurement 
experts.. .popular writers. . .and researchers. . .have described and analyzed 
test-wiseness (w) and the effects of instruction in TW upon test performance, 
little of the TV literature and research has focused on the primary grades 
where, according to Joslin and others there has been a growing dependency 
upon group administered standardized tests...". 

One interesting experiment is reported by Verster (l974) although its 
findings have limited applicability to our problems. A sample of 2,547 adults 
in an industrial setting with little test experience were given a battery of 
cognitive tests of ability. The group was then randomly divided into four 
subgroups. Group A were tested with the same battery four more times at three- 
month intervals; Group B were tested after an interval of six months, and then 
twice more at three-month intervals; Group C were tested after a nine-month 
interval, and again at the end of a year; and Group D had their first and only 



14 



-15- 



posttest at the end of the year. a?he graphs below are typical learning curves* 




0 3 6 9 12 



Months 

Figure 1. Example of Practice Effects 

Notice that the first retest for all foiu? groups produced almost identical 
gains (DX) irrespective of the time interval. This gain was roughly one 
third of a standard deviation; and even a lapse' of a year had very little 
effect. The gain for the second retest is represented by the distance CD, and 
is again virtually constant for the three groups involved; it is about a 
quarter of a standard deviation. The gains BC and AB are smaller for third and 
foiirth retests respectively, but for this sample, the total gain as the result 
of four retests is about one whole standard deviation. Even if one were to 
regard the first increase, DX, as being due to learning, the gain AD, about 
two thirds of a standard deviation, .must be attributed to increase in test- 
taking skills alone. 

15 

o 

ERIC 



If practice effect alone can cause important changes, then where gains 
for a treatment group only are being studied, some part at least would have 
to be discounted. The effect would be largest in the lower grades where there 
had been little test-taking; it would also be largest for less sophisticated 
students from countries with less emphasis on testing, for example perhaps, 
Portuguese immigrants from the Azores or Spanish-speaking Puerto Ricans. 

Even where there are comparison groups, trouble can arise. For example, 
one program tested its students with two versions of the test for pretesting; 
for the posttest again two versions were given. Furthermore, the same students 

repeated thi ocess in each subsequent grade so that by grade 4 they could 

have been --e than sixteen times. The comparison group, however, was 

drawn rai. -niv .v each year with, in all probability, a good deal lower 
average numbex of testings; at the least it received only one version of each 
test. 

In another large program, a change occurred after thfi first series of 
tests. It was decided to equalize the effects of test sophistication for all 
its program students, on the grounds that some of them had had less exposure 
than others. They ther-fore called their teachers together, gave them a 
thorough briefing on the tests to be used later, instructed them to draw up 
their own parallel forms and to use these with standard instructions to give ' 
their pupils practice in doing these tests. Of course, their argument is 
correct that nuffic:3nt repetition would lift all. students onto the plateau; 
but in converting retest scores to grade equivalents they are ignoring the 
fact that norms for that test v/ere certainly not derived from such a well- 
trained population, so that gains from' one year to the next were doubly 
spurious. 

Lastly, it should not be thought that consideration of practice effect 
can only detract from positive findings; if the control group has had greater 
exposure to tests (as sometimes happens for precisely the same reasons that 
educational compensation was sought in the first place) then a straightforward 
comparison of mean test scores may underestimate the treatment effects. 

Effects of Revisions of Test Norms 

Ve have encountered this problem in more than one study. One program 
evaluator drew attention to the use of different standards in his report, but 
unless this is done it can easily escape being noticed. 

Over the years, test publishers have sometimes found it necessary to 



16 



-15- 

revise their. norm tables. This has recently happened to the Stanford Achieve- 
ment Tests amongst others. There appears to be a substantial lowering- of 
standards; the same tslw score now q^ualifies for higher i^^Tade equivalents, more 
particularly at the upper grades where bhe difference can be an much an a full 
grade or more higher than on the older norms, Modu and Stern (1975) found 
changei3 in the Scholastic Aptitude Tests of about a third of a standard deviation 
over the years 1963-1975. Wiahever the reason for this, the ase of the older 
norms at first testing or in lower ^'^ades, follov/ed by conversions or new norms 
at retesting or in higtier grades, can make the program appear to be a colossal 
success. V^ien the tests themse]_ves have beon revised the ^ame drop in standards 
Uiidoubtr'^v '-xists, but is then ev^n nore difficult to detect or to compensate 

'^Post hoc, ex^o pr oTPter hoc" 

. Ideally, of course, benefits from programs should be attributable solely 
to the effects of use cf a. selected theme, and increased expenditures should 
be warranted by this additional theme &lone, 

Hov/ever, there can be little doubt that often the additional funding has 
been used for corsiderable improvement in conditions for the selected group 
only, and making ii. a moot point whether the use of a selected theme mcide ar.y 
positive contribution at all. It is conceivable that ther.e im^iircved ad- 

ditions could have cloaked deteriorat of performance in the select group, 
i^d could have produced more general : fits without the new elenK r.:.^ It Is 
not uncommon to find classrooms in a £ jol with a teacher and an tc 
3ope with about 20 pupils, with tape r^- 'usrs, slide projectors, Unnguaizie 
blasters and shelves of nevj books, whil^ -zt door a single teacher -xr a -^sry 
ordinary classroom deals with 30 or mor. pupils. The problem theser. .rzrpilhs in 
the ordinary classroom now have is that they did rot have a problem: tc start' 
with. 

Of course it is seldom possible to identify the particular comporjents of 
a treatment -'/hich have led to success; r.ut it is a poor demonstration, when 
features well outiriLae of the -lain thems are Inciorporated into the treatment, 
rut kept away frci:: -the comparison groTzp* Here, given the seone additional 
"'acilities, the comparison group could c -.noeivably outstrip the experimental 
rxoup in performance. 

An error that pops up now and then, in spite of repeated warnings, is 
xhat of inferring causal relationships wherever association is found. For 
example volunteer students are taught a foreign language, and it is found that 

17 



-16- 



their average achievement in the home langruage is higher than that from non- 
volunteering- students, from which it is "concluded" that learning a foreign 
langxiage benefits perfox-mance in the home langu2.ge, * overlooking several more 
likely and more plausible explanations. Volunteers are seldom typical indi- 
viduals • 

Constx^aints on _ Implem entation and Evaluation 
Conflicts with Educational Ideals 

The majority of evaluations in most studies seem to be flawed beyond re- 
demption* It ^'might be supposed, especially in view of the criticisms we have 
made of both experimental designs and of the. statistical procedures used, that 
the evaluato- xacked competence. It is, of course, true that the experiment&l 
design was usually that of the educators involved, and the statistical pro- 
cedures were often selected by them. It is also true that research is a 
specialist occupation, and that there isr no good reason whatever why teachers 
, and ed^xcational adminir-ftrators should be trained and experienced in both fields. 
But ir would be wrong to asstime that probleir.s of evaluation would largely dis- 
app wir:::^ ::'etter i,r:^nlng or more use of research specialists. Not only 
are :z>-.r-sy c±Liziculties uoavoids-ble, but it should be clearly recognized that 
moder ■ educsrLional practice and aims are often in direci opposition tr the 
needa :z£ -end research; that research is possible only by seeking and capital- 
izing -j. Deserved differences, while modern education ideologicaZly, if^not 
IdeaZl^y:^ considers dlffexsnces a call to immediate action. 

il'^in: exanrple, while random allocation of a sample to experimental and 
con-rccl groTi^rs is a powerful statistical device, it is virtually impossible in 
mcs" ^siu:ca-i::r:al situations; on the contrary, placement in the treatment group 
is C;L.:ie preci3ely becHuse there is a need to eliminate a difference "fetween 
gro^jnr* Tnder threat even of court a^::tion, it is diffic:ilt to con-slder with- 
holc::jUc:: e^.-.e3 for comparison, since i. zhe treatment was effective, this would 
proc::.::-.- dii^^erences instead of removir^- them. 

Confix with Laws and Sepulations 

ifcsi: eaucation seek?, to minimize differences between performances of indi- 
viduals -.^^sing the lower. There is, of course, an alternative philosophy; 
the gca.1 c>ou-.d be, for ~ach individual, tg^ maximize the differences between 
his corrsecTitlve performances. However, we have not often encountered such an 
objecti^.e; almost all ha.ve sought to identify and treat groups whose performance 



18 



-17- 

is below averag-e. In at least one case, this piixpose came into conflict with 
what the court considered to be the oboectives of desegregation. Thus, not 
all failures muyt be laid at the door of poor design. The following case was 
not unique, and is an example of problercs a program may have to face from 
federal, state, and coiixt jurisdictions. 

Their experimental design was as good as normal educational restraints 
permitted with commendable uae of refined statistical procedures following, 
with frank interpretation of results, and with lei's special pleading than is 
common in this area. They included a control group in theii- design; checked 
on the initial comparability of control and experimental groups; recorded 
differences in exposure to their treatment; showed the effects on various 
subjects separately. They stated their hypotheses before analyzing their 
results. They avoided the pitfalls of multiple t-tests by using a more 
elaborate ans-lysis of variance first. In the end, evidence for the success of 
their prograin was not overwhelming, but patently honest and entirely credible. 

But then they were subjected to a series of interventions beyond their 
control. Over a five year period, they had at least six regional consultants 
with concomitant - ..riations in interpretation of guidelines. A cha.nge in 
guidelines forced them to change from a planned horizontal expansion of their 
program to vertical expansion; they lost their control groups; they changed 
patterns of bussing, but foroid reduced contact between the two main groups of 
students. Next, they lost a desegregation suit to the Office of Civil Bights 
which forced them to close one school, to redistribute the students for whom 
the special treatment had been devised, and to reassign teaching staff. 

They tried to readjust but then were subjected to a drastic cut in staff. 
. They compensated by placing more emphasis on development of materials and in- 
service training, producing 2? specially trained teacher-, all but seven of 
them paid from local funds—and lost 16 of them to wealthier districts 
when state legislation was enacted forcing the spread cf the innovation. 
To cap it all, they then received a mandate to expand their program from 
gTade 4 to grade 6. 

This is not the only case in which special classes were judged to be in 
violation of desegregation guidelines. The effect of these decisions is 
obvious. Either the main thrust of the program irust be considerably blunted, 
or else all students must be compelled to follow the sa-me program, even those 
that do not need or want it. A. court decision can alter the very nature of 
a program. 



19 



-18- 



Curriculum Overload 

Some innovational programs add appreciably to the work load of teachers 
and students. If the ordinary ciirricula already fill the time available, 
something will have to give. In only one case did we find evaluators vigi- 
lant enoiigh to check progress in other subject areas. They found that less 
than one half of the set cuiricula in each of science and social studies 
had been completed in the year, and q^uite frankly attributed this to the 
effects of the increased work load. 

Uncontrollable Sajnpling Biases 

Ethical considerations if not indeed legal ones, forro two lim.i rati on^' 
upon e.lacationalists: They cannot easily deny studei... . to a program 

which is manifestly intended to confer some educational advantage; and they 
cannot easily override parental preferences even when they believe them mis- 
guided. Educationally this is perhaps often of not much consequence since in 
tne long run many alterative systems lead to alternative goals of equal merit. 
1± is quite a differen- matter when scientific demonstration hinges upon such 
aecisions; then several -undesirable interferences e^re probable, including 
significanr sampling "biases. 

For example parents of some pupils will press, v;ith a variety of moti- 
vations, for inclusion of their offspring in programs. Even when these 
pupils* resvlt:' are cc:-sidered separately, the constitution of comparison 
groups is alnicrt certain to be compromised and in a way which will make the 
program appear better than it is. On occasion we suspected that the results 
had not been partitioned, and that would enhance the program's showing still 
further. Whatever the .funding intent , some programs had considerable pro- 
portions of non-target pupils, sometimes as a result of active encouragement 
by the innovators. Even with target pupils, biases can, and demonstrably do 
occur as will be shoivn. 

Two opposing considerations affect deci3ions by volunteering parents. 
Some parents seem ur.derstandably anxious no- to interfere with a satisfactory 
progres-.on through school, by changing hcrses in midstream. Thtis especially 
at the start of a. new pra^am, those whose children h-ve already acquired some 
skills prefer not to switch, v;hile parents of children X'lho are worried by a 
general lack of achievement see the new program as a new hope. This creates 
sampling differences which snow results to the detriment of the program. 



20 



On the other hand when the program is firmly established and prohably 
with an enriched environment and increased staff, the parents of the higher 
achievers seize the opportunity to transfer their children into the program 
and thus have the nev experience as well. This kind of sampling bias produce 
data which show the program to advantage. We have encountered both trends in 
a single program at different stages of its developnerjt . 

Conclusions 

From the viewpoint ox the research ann"^ ' ■ seems +^ .-u room 

impxwement in communication belv.utiii i.iio adiainitsi.rj.cive bodies of the fmiding 
•offices, the school administrations, and the evaluation agencies. Nothing 
is going to produce substantial numbers of successful innovations; at this 
stage of educatior:al. development we can reasonably hope for few only, and 
only modest advancers. However it would be a real advance if a substantial 
reduction could be irade in the number of programs being rejected for lack of 
evidence, even if rliis meant an increase in the number disqualified by con- 
trar;y evidence; th^^^, would at least increase the niamber to which serious 
consideration couIl be given • 



21 



-20- 



Befer 



ences 



Bowers, J. B., Campeau, P. L., & Roberts, A. 0. H. Ide ntifying, validating . 
■ multi-media packaging of !.iif>.r^. Qssful reading programs . J-inal report. 
Palo Alto, California: American Institutes for Research, December 1974. 
(AIR-41 200-1 21 74-IK) . 

Callenbach, C. The effects of Instruction a .actice oa±eu apendent 

test-taking techjiiqr.en -on the standard...... T^padi- t£ i: -.cc- -^ of 

selected ser -"^^..,xy::.. . .dei.-;. Journal c, aucai. - oi,...^ I'L^^smrement . 

Spring 1973, J£ (1;, .-5-j9. ~ 

^' Handbook of research on teachinf- . Chicagos Band McNally, I963. 

Hawkridge, D. G. , ChalupskT, A. B., 3oberts, A. 0. H. A study of selectsd 
e xe mp lary progran:: for the educ&-_on of disadvantag ed ch-ilri ren. Parts 
I_aJid_II. Palo AitOj Calif omisr American Institutes for Research, 
. September I968. 

Huck, S. W., & McLean, 3.. A. Using a repeated measures ANOVA to analyze the 
data from a pretes-.-posttest design: A potentially confusing task. 
Psychologi cal Bulleti n. July I975, Q2 (4), 511-518. 

^' Score: Th e strateg:\r of taking tests . Mew York: Appleton-Centurv- 
Crofts, 1971. ~ '■^ 

Lewis, E. Repeated testing: Interpreting the results. Proceedings of the 

A ERA Annual Meeting . V/ashington 1975, (summary). •-. 

Lucas, A. M. Inflated posttest scores seven months after pre-test. Science 
Education, 1972,^(3), 5S1-337. 

Millman, J., Bishop, C. H., & Ebel, S. An analysis of test-wiseness. Edu- 
cational and Psychological Measiurement . I965, 2^ (5J, 707-726. 

Modu, C. C, & Stem, J. The stability of the SAT score scale . Research 
Bulletin, College Entrance Examination Board Research and Development 
Reports, RB-75-9. Princeton, W. J.: Educational Testing Service, 
April 1975. 

Porter, A. C., & Chibucos, T. R. Common problems of design and analysis in 
evaluative research. Sociological Methods and Research. Peb. 1975. 
1 (3), 235-257. — 

Tallmadge, G. K. The development of pro.ject information -packages for effective 
approaches in co mpensatory education . Loe AltoE, California: RMC 
Research Corporation, Ocoober I974 (Technical Report Eo. UR-254) . 

Tallmadge, G. K., & Eotez, D. P. A -oroceduz'aZ -aide for validating achievement 
gains in educaticnal xirciecta (revlr:ed) . Lor; A_tos, Galifomias BMC 
Research Corporatior.,, Dececber I974. 



22 



-21^ 



Verster, M, A. 

Py^g^^Q^-^^^ce on^the^Clagsification Test _Eattery > Confidential''rep^rt , 



5"^^^ e ffects of mining experience and multiple test exposure 
on the Classifica- ^-' - - - - 

Johannesburg, CSIR, ^* ?R, T97/l7 



Welch, V 



eva^ 



2(4), 



■g, J. Prete. 
: arm Educationc: 



ensit .l;, effects in curriculum 
■i,rch JoiuTial t ->vember 1970, 



